3 research outputs found
MV6D: Multi-View 6D Pose Estimation on RGB-D Frames Using a Deep Point-wise Voting Network
Estimating 6D poses of objects is an essential computer vision task. However,
most conventional approaches rely on camera data from a single perspective and
therefore suffer from occlusions. We overcome this issue with our novel
multi-view 6D pose estimation method called MV6D which accurately predicts the
6D poses of all objects in a cluttered scene based on RGB-D images from
multiple perspectives. We base our approach on the PVN3D network that uses a
single RGB-D image to predict keypoints of the target objects. We extend this
approach by using a combined point cloud from multiple views and fusing the
images from each view with a DenseFusion layer. In contrast to current
multi-view pose detection networks such as CosyPose, our MV6D can learn the
fusion of multiple perspectives in an end-to-end manner and does not require
multiple prediction stages or subsequent fine tuning of the prediction.
Furthermore, we present three novel photorealistic datasets of cluttered scenes
with heavy occlusions. All of them contain RGB-D images from multiple
perspectives and the ground truth for instance semantic segmentation and 6D
pose estimation. MV6D significantly outperforms the state-of-the-art in
multi-view 6D pose estimation even in cases where the camera poses are known
inaccurately. Furthermore, we show that our approach is robust towards dynamic
camera setups and that its accuracy increases incrementally with an increasing
number of perspectives.Comment: Accepted at IROS 202
FusionVAE: A Deep Hierarchical Variational Autoencoder for RGB Image Fusion
Sensor fusion can significantly improve the performance of many computer
vision tasks. However, traditional fusion approaches are either not data-driven
and cannot exploit prior knowledge nor find regularities in a given dataset or
they are restricted to a single application. We overcome this shortcoming by
presenting a novel deep hierarchical variational autoencoder called FusionVAE
that can serve as a basis for many fusion tasks. Our approach is able to
generate diverse image samples that are conditioned on multiple noisy,
occluded, or only partially visible input images. We derive and optimize a
variational lower bound for the conditional log-likelihood of FusionVAE. In
order to assess the fusion capabilities of our model thoroughly, we created
three novel datasets for image fusion based on popular computer vision
datasets. In our experiments, we show that FusionVAE learns a representation of
aggregated information that is relevant to fusion tasks. The results
demonstrate that our approach outperforms traditional methods significantly.
Furthermore, we present the advantages and disadvantages of different design
choices.Comment: Accepted at ECCV 202